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Abstract —We consider the problem of learning models for 
forecasting multiple time-series systems together with discovering 
the leading indicators that serve as good predictors for the 
system. We model the systems by linear vector autoregressive 
models (VAR) and link the discovery of leading indicators to 
inferring sparse graphs of Granger-causality. We propose new 
problem formulations and develop two new methods to learn such 
models, gradually increasing the complexity of assumptions and 
approaches. While the first method assumes common structures 
across the whole system, our second method uncovers model 
clusters based on the Granger-causality and leading indicators 
together with learning the model parameters. We study the per¬ 
formance of our methods on a comprehensive set of experiments 
and confirm their efficacy and their advantages over state-of- 
the-art sparse VAR and graphical Granger learning methods. 


I. Introduction 

In many application areas multiple indicators are monitored 
over time. For planning purposes reliable forecasts of their 
future developments are needed. While the monitored series 
are usually closely related, often there is little understanding 
of these relationships coming from the specific domain knowl¬ 
edge. 

We consider the problem of forecasting such systems of 
multiple time series from their past evolution together with 
discovering the main leading indicators within the system that 
help predicting the future values of most of the series. This 
twofold objective is driven by realistic considerations of the 
forecasting practice. 

In forecasting exercises practitioners typically endeavour to 
gather as much helpful data as possible. Many a time this 
may lead to cluttering their models with indicators with little 
predictive benefit. Discovering the leading indicators helpful 
for forecasting the whole system or its major parts may aid the 
analysts in understanding the relationships between the series 
and in simplifying their models and their forecasting systems 
accordingly (data acquisition, storage, etc.). 

While in reality the set of time series included in the 
system is certainly not exhaustive, we deliberately concentrate 
the analysis on the system series setting aside any possible 
(and likely) external effects. Developing our previous line of 
reasoning, we take it that external confounders cannot be used 
as predictors for the system either because their identity is 
unknown (a hidden confounder) and/or because their data are 
not available. On the other hand, some of the series included in 
the system may serve as surrogates for such unavailable data 


and understanding their contribution to the forecasting model 
is of utmost interest. 

By a leading indicator we understand a time series whose 
past values are useful for predicting the future of other 
time series within the system, a notion known in the time- 
series community as Granger causality (G-causality). The G- 
causality structure of a time-series system can be described by 
a directed graph where each node is a time-series and an edge 
is a directed G-causal link. 

We assume little prior knowledge about the relationships 
between the series, except that the domain knowledge supports 
our main assumptions about the existence of lagged dependen¬ 
cies between the series and the general over-specification of 
the system with only a few main leading indicators. In result, 
we expect the G-causal graph to have rather specific structure: 
sparsely connected graph with only a few nodes with out-edges 
leading to most of the other nodes in the graph. These centrally 
located nodes are the leading indicators of the system that we 
wish to discover. 

In this paper, we focus on linear vector autoregressive mod¬ 
els (VARs) that are simple yet robust models well established 
in the time-series forecasting literature, e.g. m, in particular 
for modelling stationary time series that we concentrate on 
in this work. The G-causality is naturally linked to the VAR 
models through the model parameters matrix that can be 
interpreted as the adjacency matrix of the G-causal graph. 
We therefore formulate our twofold problem of forecasting 
the system together with discovering the leading indicators as 
learning VAR models with parameters structure corresponding 
to our G-causal graph assumptions. 

Several methods have been recently proposed for learning 


has more detailed discussion of the related work). However, 
these methods neglect the possibility of shared structures in 
the lagged dependencies captured by the G-causality graph. 

The goal of this paper is to demonstrate that leveraging such 
shared G-causality structures can improve the VAR learning 
and yield better forecasting performance of the models, in 
particular, in the small-sample-large-dimensionality settings 
which are very common in real multivariate time-series fore¬ 
casting problems. 

We build on the regularized multi-task learning paradigms, 
e.g. ED, and the techniques of structured regularization, e.g. 
0, to develop two new methods for learning the G-causality 
in VARs with leading indicators (section m- The first, 


the G-causality in the VARs, e.g. [2], f3J, ||4] (section IV 



SingleCluster-VAR (SCVAR), somewhat naively assumes that 
the leading indicators are useful for forecasting the whole 
system, that is all the series within the system. The second, 
MultiCluster-VAR (MCVAR), assumes more realistically that 
there are different leading indicators useful for predicting 
different parts of the system, that is groups of the series. These 
groups are, however, a priori not known and need to be learned 
together with the model parameters. 

To the best of our knowledge these are the first G-causality 
learning methods that consider common sparse structures in 
the lagged dependencies. Also, our approach to learning clus¬ 
ters around their leading indicators together with discovering 
the graph of G-causality is novel. 

To document the validity and the power of our approach 
we test our methods on a set of artificial and real data 
experiments and compare their performance with the state- 
of-the-art methods for learning G-causality in VARs (section 
0 In the synthetic datasets coherent with our structural 
assumptions our methods clearly out-perform the state-of- 
the-art competitors. At the same time, they are robust to 
violations of such assumptions having predictive performance 
at least as good as other tested methods in such synthetic data. 
Our methods also show superior performance on several real 
datasets. 

II. Preliminaries 

A. Vector autoregressive model 

For a set of K time series observed at T synchronous 
equidistant time points, we write the VAR in the form of a 
multi-output regression problem as 


are highly over-parametrised (Kp T) and some form of 
regularisation needs to be used to condition the model learning. 

B. Granger-causality graphs 

In (2) the author defines causality based on the predictability 
of the series: we say that Z G-causes Y if we can forecast Y 
better using the history of Z than without it. This concept can 
be extended to sets of series so that a set of series {Zi,..., Zi} 
is said to G-cause series Y if Y can be better predicted using 
the past values of the set. 

The G-causal relationships can be described by a directed 
graph Q = {V. £} in which nodes represent the time series in 
the system, and a directed edge e./> from vi to Vk means that 
time series l G-causes time series k, e.g. (S). 

In VARs the G-causal relationships are captured within 
the W parameters matrix of model Q- When any of the 
parameters of the fc-th task (fc-th column of the W) referring 
to the p past values of the Z-th input series is non-zero, we 
say that the Z-th series G-causes series k, and we denote this 
in the G-causality graph by a directed edge eu,- from Vi to Vk 
(see figure [TJ. 

C. Multi-task learning 

In multi-task learning (MTL) we benefit from learning 
several tasks together (as opposed to learning the tasks in 
isolation) by allowing the models to share information and 
learn one from another. The multi-task learning methods find 
an appropriate balance between task specificity and similarity 
of the models in order to maximize the overall predictive 
performance. 


Y = XW + E, (1) 

where Y is the T x K output matrix for T instances and K 
time series as individual forecasting tasks, X is the T x Kp 
input matrix so that each row x ti . of the matrix is a Kp long 
vector with p lagged values of the K time series as inputs 
X t „ = (y t -i,i,yt-2,i, - ■ ■ ,yt-p,i,yt-i,2, ■ ■ ■ ,yt-p,K)^] and 
W is the corresponding KpxK parameters matrix where each 
column w fc is a model for a single time series forecasting 
task. We follow the standard time series assumptions: the 
T x K error matrix E is a random Gaussian noise matrix 
with i.i.d. rows e*,. ~ N( 0, X) and diagonal covariance X; 
the time series are second order stationary and centred (so that 
we can omit the intercept). 

In principle, we can estimate the model parameters by 
minimising the standard squared error loss 

T K 

I(W):=^^( 2/tjfc -(w., fe ,x t ,.)) 2 (2) 

t=l fc=l 

which corresponds to maximising the likelihood with i.i.d. 
Gaussian errors. However, since the dimensionality Kp of the 
regression problem quickly grows with the number of series 
K (by a multiple of p ), usually already relatively small VARs 

1 By convention, all vectors in this paper are column vectors. 


III. Learning VARs with leading indicators 


We develop two new methods for learning VARs with 
leading indicators. These methods follow two different struc¬ 
tural assumptions for the G-causality graphs: SingleCluster- 
VAR (SCVAR) assumes that there are leading indicators 
within the system that help predicting all the series in the 
system (assumption 1[ ) and aims at identifying those indica¬ 
tors; MultiCluster-VAR (MC) assumes that there are different 


leading indicators for different clusters of series (assumption 


|2| and aims at learning both, the leading indicators as well as 
the unknown clusters. 

Both our methods exploit the MTL ideas and intertwine the 
model learning for the system by imposing structural similarity 
constraints derived from the assumptions above. This is in 
contrast to other state-of-the-art VAR and graphical Granger 
learning methods which, albeit being usually initially formu¬ 
lated as a single multi-output problem, typically decompose 
into a set of single-output tasks (due to their simple additive 
structure) which prevents the models from sharing information 
at learning time. 

Though intertwined by the structural similarity, in both 
methods the models are allowed to diverge as per the specific 
time-series predictive task a) in their parameter values and b) 
by allowing each series to depend on its own lagged values 





(assumption 31. In this way we balance the structural similarity 
with the need for task-specificity. 

The rather restrictive structural assumption of SCVAR is 
obviously somewhat naive and limits the usefulness of the 
method to quite particular types of time-series systems. We 
present it here mainly as a stepping stone for the more general 
and flexible MCVAR which we consider to be the main 
contribution of this paper. 


5,Vfc) for assumption 1; and a fixed diagonal matrix B = tI 
for assumption 3. We tie the matrices together by the identity 
r = A diag(A) + B, where the middle term frees the 
diagonal elements from the common learning of A and thus 
leaves them to be specified by B. We set r = 1 in all the 
following to achieve identifiabilit)!^] 

We finally formulate the full optimisation problem for 
SCVAR as a constrained minimisation of loss function (|3]l 


A. SingleCluster-VAR 

We first translate the assumption 1 and [3] into the structural 
properties of the parameters matrix W of the VAR model 
< 0 - According to assumption 1 it shall be a sparse matrix 
with common non-zero blocks of rows (a block has all lags 
of a leading indicator) across all the columns (a column is a 
learning task); from assumption 3 the matrix shall have full 
block-diagonal elements (figure [lj. 




a) W matrix b) Granger-causal graph 

Fig. 1. Schematic structure of the SCVAR parameters matrix W and the 
corresponding G-causality graph for a system of K = 7 series with series 2 
and 5 being the leading indicators. In a) the number of lags is p = 3, the 
gray cells are the non-zero elements. In b) the self-loops corresponging to the 
block-diagonal elements in W are omitted for clarity of display. 

Next we bring these structural assumptions into the VAR 
learning problem. We partition the input vectors x, and the 
parameters w k in eq. 0 into I\ p-long sub-vectors x t = 

(x't, i> 2 ’ - and W A = - ,™K,kY so 

that the x tjb contains the p lagged values of series b at time t 
and wj, k has the parameters for this input sub-vector for the 
fc-th predictive task. We then decompose each of the parameter 
sub-vectors w b ,k = into the product of a non-negative 

scalar 7 b and a p-long vector v b k so that each element 7 ^ 
of the K x K matrix T is associated with one p-long sub¬ 
vector in W. Using this decomposition we rewrite the loss 
function (| 2 ) as 

T K K 

L{ r,V) :=£X>t,k-X>,jdvM.*t, 6 » * 2 . ( 3 ) 

t -1 fc=1 6=1 

where V is the Kp x I\ matrix with columns vj, : = 
(vj k , w' 2 fc) • • •, vk k Y (same shape and decomposition as 
W). It is by controlling the structure in T now that we can 
control the block-structure in W as per our assumptions above. 

Since we need to combine two structural assumptions (1 
and 3) within a single problem, we further decompose matrix 
r into two K x K matrices: a sparse matrix A with the same 
sparsity pattern (to be learned) in all its columns (o. k = 


argmin rv L(r,V), (4) 

s.t. l' K a = k; a > 0 ; ||V|||, < e 

where \\.\\r is the Frobenious norm, 1 k is a A'-long vector 
of ones, r = A — diag(A) + I, and a. tk = a for all k. 

In Q we impose a standard ridge penalty j9] on the model 
parameters V, and a simplex constrain^ on the common 
structural vector oi. It is this simplex constraint which plays 
the most important role in promoting sparsity in the final W 
matrix at the pre-specified block-level. 

In SCVAR we solve the non-convex optimisation problem 
(|4ji to find a local minimum by alternating descent for V and 
a as outlined in algorithm |T| We initialize a evenly so that 
«/, = k/K for all b. We solve the simplex constrained least 
squares problem by projected gradient descent method with 
backtracking stepsize rule iflOl. 


Algorithm 1 SCVAR 
Input: Y, X, A, k; initialize a. 

repeat 

Step 1: Solve for V 

put a. t k = ct for all columns of A 
get r = A — diag( A) + I 
re-weight input blocks = 7 b,k *t,b- 

argmin v fc ||y. )fe - Z fc v. >fc ||| + A||v. )fe ||| Vfc 
Step 2: Solve for a 

get block products h t>b> k = (v b ,/c,x tjb ) 
get residuals r t>k = y t .k ~ h t , k ,k using own history 
concatenate ht,b,k into T x K matrix H/ and replace 
fcth column in H b by zeros 

concatenate H k matrices into KT x K matrix H 
argrnin^ ||vec(R) — Ha|||, s.t. a on simplex 
until convergence 


B. MultiCluster-VAR 


In the MultiCluster-VAR we move from assumption 1 to the 
more complex (and more realistic) assumption 2 of a cluster 
structure within the system. 

If the cluster structure were known a priori, the models and 
the leading indicators could be learned by a slight modification 
of the SCVAR method above: associate each of the C clusters 
with a structural vector of and set o: b = o' for all tasks /;: 


2 Note that w bjb = ( 1 - 7 ^) (v b ,fc/r), and 7 *,,* = /3 ktk = T. 

' k controls the relative weight of each series own past vs the past of all 
the neighbouring series. 





























belonging to cluser c; in step 2 of algorithm [I] solve for each 
a c from the respective R c and H c matrices. 

In reality, however, the cluster structure is typically not 
known a priori and needs to be learned together with the model 
parameters, which is what our MCVAR method achieves. 

To bring the assumptions 2 and [3] into the VAR learn¬ 
ing of MCVAR we use the same structural matrices T = 
A — diag(A) + I as in SCVAR. However, we drop the 
constraint of strict equality between the columns of A (coming 
from assumption 1 ) and instead introduce a constraint forcing 
the columns of A to live in a low-dimensional subspace 
(the clustering assumption 2). More specifically, we factorize 
A = DG into two lower-dimensional matrices r < K: 
the K x r dictionary matrix D with the dictionary atoms 
(columns of D) representing the cluster prototypes of the 
dependency structure, and the r x K matrix G with the 
per-model dictionary weights. We further impose sparsity 
promoting simplex constraints on both the dictionary atoms in 
D and the weights of the atoms combination for each model 
in the columns of G. Figure [2] illustrates the roles of the D 
and G matrices in the low-rank decomposition of A. 



(t)d>®(b(bfetD 


Fig. 2. Schema of the role of the D and G matrices in the low-rank 
decomposition of matrix A in the MCVAR method. Imaginary system with 
K = 7 time series and rank r = 3 matrix A. The d columns are the sparse 
cluster prototypes, the non-zero elements are shaded. The numbered circles in 
the bottom are the individual task models, the arrows are the elements of the 
weight matrix G. Solid arrows have weight 1, dashed arrows have weights 
between 0 and 1. 


Using these factorization matrices, we formulate the MC¬ 
VAR optimisation problem 


argmin r y i(r, V), 
s.t. l' K d.j = k Vj; d. j > 0 Vj; 
Ks., k = 1 Vfc; g., fe > 0 Vfc; ||V||f, < e 


(5) 


where A = DG, and T = A — diag(A) + I. 

If in the above the rank is set equal to the size of the system 
(r = K ) the model learning disentangles into K independent 
learning tasks. If r = 1 the MCVAR is equivalent to our 
previous SCVAR. In this sense, MCVAR is a more versatile 
generalization of our two methods which can cater for all of 


the assumptions 1-3 by setting the value for r correspondingly. 

We find a local minimum of the non-convex problem ([5]) 
similarly as in SCVAR by alternating descent for V, D and 
G as outlined in algorithm [2] We initialise matrices D and G 
evenly so that dij = n/K and = 1 /r for all i,j, 0 is the 


Hadamard product, 0 is the Kronecker product, and vec(.) is 
the vectorization operator. 


Algorithm 2 MCVAR 

Input: Y, X, A, k\ initialize D, G. 

repeat 

Step 1: Solve for V 
get A = DG 
get r = A — diag(A) + I 
re-weight input blocks z t)6(fc = 7 fcjfc x tjb . 
argmin v k ||y. )fe - Z fe v. >fc ||§ + A||v. jfc ||| Vfc 
Step 2a: Solve for G 

get block products h tyb , k = (v b>jfc ,x tj6 ) 
get residuals r t , k = Vt,k ~ h t ,k,k using own history 
concatenate ht, b ,k into T x K matrix H/, and replace 
A'th column in H b by zeros 

argmin g ||r ^ — H b .Dg k \\%, s.t. simplex Vfc 
Step 2b: Solve for D 

concatenate H k matrices into KT x K matrix H 
get G = G' 0 l T lV, H = lj, 0 H 
argmin„ ec ( D ) ||vec(R) — G © Ht;ec(D)|||, s.t. sim¬ 
plex on columns of D 
until convergence 


To understand better the effects of our methods on the 
VAR learning we can link our formulations to some other 
standard learning problems though bearing in mind that these 
have different aims and different assumptions than ours. We 
can rewrite the inner product in the loss function 0 as 
(v b| fc, 76 ,fcXt 1 b). In this ’’feature learning" formulation, the vec¬ 
tors 7 k act as weights for the original inputs and, hence, gen¬ 
erate new (task-specific) features z t ,b,k = 7b, fc Xt,b (see also 
algorithms [I] and |2j. Alternatively, we can express the ridge 
penalty on V used in eq. 0 as ||V|||, = J2 b ,k llw.fclli = 
Y^b k VTb fellah,fclli- I n this - * adaptive ridge" formulation the 
elements of T, which in our methods we learn, act as weights 
for the t 2 regularization of W. Equivalently, we can see this 
as the Bayesian maximum-a-posteriori with Guassian priors 
where the elements of T are the learned priors for the variance 
of the model parameters or (perhaps more interestingly) the 
random errors. 


IV. Related work 

Our work falls into the category of methods for learning 
the G-causality in VARs. As shows the list of references in 
the survey working-paper HU. this has been a rather active 
research area over the last several years. 

The closest in spirit to ours are the graphical Granger 
methods, in particular the lasso-based algorithms fT 2 l for 
discovering sparse G-causality graphs. We use the two best- 
established ones, the lasso-Granger (LG) of 03 and the 
grouped-lasso-Granger (GLG) of J2, as the state-of-the-art 
competitors in our experiments. More recent adaptations of 
these address the specific problems of determining the order 
of the models and the G-causality simultaneously 0 , na, 
the G-causality inference in irregular ED and subsampled 













series ED, and in systems with instantaneous effects H3). 
However, neither of the above methods considers or exploits 
any common structures in the G-causality graphs as we do in 
our methods. 

Common structures in the dependency are assumed in lfl8l 
and ED though the common interactions are with unobserved 
variables from outside the system rather then within the 
system itself. Also, the methods discussed in these have no 
clustering ability. 120) considers common structures across 
several datasets (in panel data setting) instead of within the 
G-causality graph of a single dataset. ETI assume sparse bi¬ 
clustering of the nodes (by the in- and out- edges) to learn 
fully connected sub-graphs in contrast to our shared sparse 
structures within the G-causality graph. 

Our work builds on the standard regularized multi-task 
learning and structured regularization techniques developed 
outside the time-series settings. Similar block-decompositions 
of the feature and parameter matrices as we use in our 
methods have been proposed ll22l . l23l to promote group struc¬ 
tures across multiple models although the methods developed 
therein have no clustering capability. Various approaches for 
learning model clusters are discussed in |24j. 1125:1, (2611, 1271, 
ll28l of which the latest uses similar low-rank decomposition 
approach as our method. Nevertheless, neither of these ap¬ 
proaches learns sparse models and builds the clustering on 
similar structural assumptions as our methods do. 

V. Experiments 

We have performed a set of experiments on artificial and 
real-world datasets to document the performance of our meth¬ 
ods and their ability to recover the G-causality structures corre¬ 
sponding to our assumptions 1-3. We measure the forecasting 
accuracy by the mean squared error of 1-step-ahead forecasts 
over a set of hold-out points averaged across all the series in 
the system. 

To evaluate the ability to recover the expected structures we 
calculate the number of discovered leading indicators and the 
number of edges in the G-causality graph. In the synthetic 
experiments we calculate the accuracy of the inferred G- 
causality graph as the percentage of true-positive and true¬ 
negative edges on all edges in the graph. 

In all the experiments we compare the results of our 
methods generated from single runs (single initialisation) 
of our algorithms [I] and [2] against simple baseline models: 
constant prediction (Mean), random walk (RW), univariate 
autoregression (AR); and the state-of-the-art graphical Granger 
methods for learning sparse VARs: lasso-Granger (LG) and 
grouped-lasso-Granger (GLG). 

We summarise the experiments and their outcomes in the 
two subsections below. Further details on the experimental 
settings and the numerical results are available in the Appendix 
of the online version of this paper. 

A. Synthetic experiments 

We have created five synthetic datasets (A-E) to explore the 
behaviour of our methods under different settings. In the first 


four (A-D), we examine multiple scenarios of the G-causality 
structures as reflected by the parameter matrices W depicted 
in figure |4ji) on systems of 10 series (K = 10). The first 
two scenarios correspond to the assumptions of our methods 
while the other two are on purpose incoherent with them. 
Namely, A has the structure of SCVAR; B has the structure 
of MCVAR with 2 clusters; C is a simple AR model without 
any dependency structure in between the series; D has a fully 
connected G-causality graph. The fifth dataset (E) is to confirm 
the efficacy of our methods also in larger systems (K = 30) 
with more complicated cluster structures (3 clusters in figure 
00 ). 

All our artificial datasets have been generated from station¬ 
ary VAR models with p = 3 and zero centred Gaussian noise 
with covariance X = I. For all the experiments A-E we keep 
the last 500 data points of the series in the hold-out and train 
the models using training sets with lengths 30 to 500 data 
samples (time points). We tune the hyper-parameters by 5- 
folds inner cross-validation within the training sample^] 
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Fig. 3. Average l-step ahead forecast error for A-D scenarios of synthetic 
experiements measured by relative MSE (see text). For clarity of display Mean 
and RW methods performing worse by an order of difference are omitted from 
the plots. In A and C, MCVAR and SCVAR overlap. 






In figure [3] and table |T] we compare the predictive perfor¬ 
mance of the tested methods in terms of mean squared error 
of 1-step ahead predictions averaged across the 500 hold-out 
points and the 10 series in the systems A-D. We use the 
relative MSE as compared to the forecast error of the VARs 
with the tme parameter matrices used for generating the data 
MSE re i ative = MSE method /MSE true . The violet (SCVAR) 
and green (MCVAR) lines are always below the red (LG) and 

4 We use a 10-point logarithmic grid in the interval [10 4 ,1] for A and 
4-point logarithmic grid between [10 2 ,10] for k. 














TABLE I 

Synthetic experiments: relative MSE for 1-step ahead 

FORECASTS (HOLD-OUT SAMPLE AVERAGE) 


Size 

30 

50 

75 

100 

200 

500 

Mean 

6.81 

6.81 

6.81 

6.81 

6.81 

6.81 

RW 

8.90 

8.90 

8.90 

8.90 

8.90 

8.90 

AR 

3.97 

3.86 

3.94 

3.96 

3.67 

3.62 

LG 

2.28 

1.79 

1.47 

1.36 

1.16 

1.06 

GLG 

2.40 

1.76 

1.49 

1.38 

1.15 

1.05 

SCVAR 

1.89 

+++++ 

1.38 

+++++ 

1.15 

1.15 

1.06 

1.03 






MCVAR 

1.84 

+++++ 

1.39 

+++++ 

1.14 

1.17 

1.06 

1.03 

+++== 





Mean 

3.18 

3.18 

3.18 

3.18 

3.18 

3.18 

RW 

4.56 

4.56 

4.56 

4.56 

4.56 

4.56 

AR 

1.98 

1.72 

1.66 

1.66 

1.64 

1.62 

LG 

1.96 

1.26 

1.14 

1.11 

1.06 

1.03 

GLG 

1.78 

1.25 

1.14 

1.13 

1.08 

1.02 

SCVAR 

1.45 

+++++ 

1.18 

+++++ 

1.09 

+++++ 

1.06 

1.04 

1.01 

++++= 




MCVAR 

1.58 

+++++ 

1.14 

+++++ 

1.06 

1.05 

1.03 

1.01 

++++= 





Mean 

1.63 

1.63 

1.63 

1.63 

1.63 

1.63 

RW 

3.55 

3.55 

3.55 

3.55 

3.55 

3.55 

AR 

1.11 

1.07 

1.03 

1.03 

1.01 

1.01 

LG 

1.47 

1.23 

1.12 

1.10 

1.07 

1.02 

GLG 

1.45 

1.35 

1.22 

1.13 

1.05 

1.02 

SCVAR 

1.14 

1.12 

1.04 

1.04 

1.02 

1.01 


++=++ 

++=++ 

++=++ 

++=++ 

++=++ 

++=++ 

MCVAR 

1.14 

1.12 

1.04 

1.04 

1.02 

1.01 


++=++ 

++=++ 

++=++ 

++=++ 

++=++ 

++=++ 

Mean 

6.56 

6.56 

6.56 

6.56 

6.56 

6.56 
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AR 

4.02 
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3.58 
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3.44 

3.41 

LG 
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1.45 
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1.69 
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Signs below SCVAR and MCVAR methods indicate if the result is signifi¬ 
cantly better (+), worse (-), or neither (=) as compared to the other methods 
(in order of the rows in the table) using a one-sided paired-sample t-test at 
5% significance level. 


yellow (GLG) lines in experiments A-C and the difference in 
the performance is statistically significant at 5% significance 
level at all the points except 2 (SCVAR and MCVAR vs GLG 
in experiment B with 500 sample size). In the D experiments, 
all the methods perform comparably with our methods being 
significantly better 5 times, significantly worse 5 times and 
without a significant difference in the remaining 14 cases (our 
2 methods compared vs LG and GLG for 6 sample sizes). 
These results confirm the utility of our methods for improving 
forecasting performance and show that our methods behave 
well even in systems which violate their initial assumptions. 


In all the experiments the relative MSE decreases with the 
increasing training size and eventually all the methods (except 
AR in experiments A,B,D) have predictive performance near 
the true models. In accordance with our goals, it is in the 
small-sample settings where the performance of the models 
differs the most and where the advantage of our methods is 
the most prominent. 


a) True 
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b) MCVAR 
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Fig. 4. Heatmaps of the VAR parameter matrices W for four scenarios of 
synthetic experiments A-D. a) True parameters used for VAR simulations; b)- 
d) learned parameters by the respective methods with training samples fixed 
to 100 instances. 


In addition to predictive performance, we monitor the 
performance of our and the competing methods in terms of 
our second objective: the recovery of leading indicators and 
the cluster structure in the G-causality graph. Figure [4] gives 
examples of the parameter matrices W learned by the tested 
methods in comparison to the true parameters from which 
the data of the four synthetic experiments A-D have been 
generated. From a visual inspection, the MCVAR seems to 
be closer to the True than the other two methods, mainly in 
the A and B experiments. 

We provide a more systematic quantitative evaluation in 
figure [5] where we compare the Granger accuracy of the 
methods for different training set sizes across the four experi¬ 
ments. By Granger accuracy we mean the accuracy in correctly 
identifying the edges in the G-causality graph underlying 
the learned VAR model (accuracy = (true positive + true 
negative)/(true positive + true negative + false positive + 
false negative)). In the A and B experiments coherent with 
the initial assumptions of our methods, the violet (SCVAR) 
and green (MCVAR) lines are always above the red (LG) 
and yellow (GLG) lines which struggle to recover the correct 
structures even for larger samples. In the other 2 experiments, 
the performance of all the methods is comparable. The simple 
baseline AR model is included in the graphs for consistency 
reasons though it does not actually discover the structure but 
has it fixed by construction. 

For the E dataset, the experimental results are summarised 
in figure [6] using the same metrics as above and confirming 
the superiority of our methods over its competitors in terms 















































































Fig. 5. Accuracy of the recovered G-causal graphs for A-D scenarios of 
synthetic experiements. For clarity of display and consistency with figure [5] 
Mean and RW methods are omitted from the plots. In C. MCVAR and SCVAR 
overlap. 


of both our objectives, predictive performance (significantly 
better at 5% sig. level in all 24 cases) as well as the G-causality 
recovery. 



Fig. 6. Forecasting performance and the accuracy of the recovered G-causal 
graphs for scenario E of the synthetic experiments. For clarity of display Mean 
and RW methods performing worse by an order of difference are omitted from 
the plots. 

Though the structure of system E is clearly more com¬ 
plicated than what the SCVAR method assumes, we have 
included it into the evaluation. Somewhat surprisingly, it 
performs almost as good as the more complex MCVAR. This 
is in part due to the overall strong sparsity of the true models. 
The SCVAR therefore suffers only little degradation from 
including all the leading indicators into all the models due 
to the shrinking effect of the t> regularization on such miss- 
specified model parameters. It is also in part due to the 3rd 
cluster (emphasized in figure [7] in red) which is particularly 


a) True b) SCVAR c) MCVAR _ d) LG _ e) GLG 



Fig. 7. Heatmaps of the VAR parameter matrices W for scenario E of the 
synthetic experiments (see text), a) True parameters used for VAR simulations; 
b)-e) learned parameters by the respective methods with training samples fixed 
to 100 instances. The red rectangle emphasizes the 3rd model cluster. 

difficult to discern because of very low parameter values of 
its leading indicator. 

B. Real-data experiments 

For the real-data experiments we use two sources of data 
very different in nature in order to verify the performance 
of our methods in handling more diverse problems. The first 
is the database of the Water Services of the US geological 
survey (http://www.usgs.gov/) providing (amongst other) the 
daily averages on water physical discharge along the river 
stream^] The second is a dataset of major US quarterly macro- 
economic indicators of |29l (full list in the Appendix of the 
online version). 

Out of these data-sources, we have constructed four ex¬ 
perimental datasets, three from the water data and one from 
the economic data: Connecticut with data from K = 8 
measurement sites along the river (650km long in the north¬ 
east of the US); Yellowstone with data from K = 7 sites (over 
1000km long in the west of the US); Two rivers combining 
these two into a single system with K = 15; Economic with 
I\ = 20 major macro-economic indicators. We preprocessed 
all the datasets by standard time-series transformations to 
achieve stationaritj^j and by normalizing. 

The forecasting objective for all of these experiments is to 
predict the values of the series in the next future point, the next 
day discharge at all of the river sites and the values of the full 
set of the macro-economic indicators in the next quarter. We 
used a hold-out of the last 100 observations of the series (50 
for the Economic data), set p = 5 (p = 3 for Economic) 
and trained the models on the previous 30-300 observations 
with 5-folds inner cross-validation within the training sets for 
tuning the hyper-parameters. 

We provide the summary of the results of the four ex¬ 
periments in figure [8] The left column shows the relative 
MSE of the tested methods as compared to the baseline 
RW model (below 1 means better than RW). The two other 

5 USGS parameter code 00060 - physical discharge in cubic feet per second. 

Calculating the logs of year-on-year growth for the water data and by 
differencing or log-differencing for the economic data. Full list of applied 
transformations is in the Appendix. 
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Fig. 8. Forecasting accuracy and the sparsity of the G-causality graphs 
for the 3 river-flow and the Economic data experiments. The first column: 
average 1-step ahead forecast error measured by relative MSE; the second 
column: number of edges in the G-causality graph; the third column: number 
of leading indicators in the G-causality. For clarity of display the Mean method 
performing worse by an order of difference in all the experiments is omitted 
from the plots. 


columns summarise the sparsity of the G-causality graphs in 
terms of the number of edges and the number of discovered 
leading indicators. Our methods (violet and green lines) have 
overall better forecasting performance than the state-of-the- 
art G-causality learning methods (significantly better at 5% 
significance 31 times, worse 6 times, no difference 39 times). 
At the same time, our methods learn much sparser G-causality 
graphs uncovering useful structures within the systems as 
illustrated in figure [9] 


a) SCVAR b) MCVAR c) LG d) GLG 



Fig. 9. Heatmaps of the learned VAR parameter matrices W for the Two 
rivers experiment with training samples fixed to 100. The measurement sites 
are ordered in the datasets from the top to the bottom following the rivers 
from their streams to their mouths. The blue and red rectangles separate the 
Connecticut and Yellowstone clusters. 


TABLE II 

Real-data experiments: relative MSE for 1-step ahead 

FORECASTS (HOLD-OUT SAMPLE AVERAGE) 


Size 

30 

50 

75 

100 

300 


Mean 

2.29 

2.29 

2.29 

2.29 

2.29 


AR 

0.86 

0.85 

0.83 

0.84 

0.83 
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0.71 

0.71 

0.70 

0.61 

0.59 

o 
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0.79 

0.82 

0.64 
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0.60 
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0.72 

0.74 
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AR 
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0.88 
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GL 
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0.87 

0.78 

0.68 
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6.93 

6.93 

6.93 

6.93 

6.93 


AR 

0.97 

0.99 

0.98 

0.96 

0.93 

C/3 

J-H 

GL 

0.87 

0.81 

0.90 

0.71 

0.70 

> 

GLG 

1.37 

1.05 

0.95 

0.76 

0.70 

O 

£ 

SCVAR 
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0.69 
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MCVAR 
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30 

50 

75 
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0.60 

0.60 

0.60 
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AR 

0.48 

0.46 

0.46 

0.47 
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0.49 

0.52 

0.51 

0.48 
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Signs below SCVAR and MCVAR methods indicate if the result is signifi¬ 
cantly better (+), worse (-), or neither (=) as compared to the other methods 
(in order of the rows in the table) using a one-sided paired-sample t-test at 
5% significance level. 


VI. Conclusions 

We have developed two new methods for learning sparse 
VAR models with shared structures in their Granger causality 
graphs based on the leading indicators of the system, a 
problem that had not been previously addressed by the G- 
causality learning methods. 

The first method, the somewhat simpler and more naive 
SCVAR, serves as a building stone for the more complex and 
versatile MCVAR which achieves several learning objectives 
simultaneously: good forecasting performance of the models, 
discovery of clusters within the G-causality graphs and of their 
leading indicators. 

We have confirmed the efficacy of our methods in a series 
of synthetic and real-data experiments where our methods 
systematically outperformed the state-of-the-art methods in 
both the predictive performance and the sparsity of solutions. 

Extensions of our methods worth exploring in future work 
are the analysis of contemporaneous dependencies, relaxation 
of the stationarity assumption and non-linearities in the de¬ 
pendencies and the time-series generative processes. 
























































































References 

[1] H. Liitkepohl, New introduction to multiple time series analysis. 
Springer-Verlag Berlin Heidelberg, 2005. 

[2] A. C. Lozano, N. Abe, Y. Liu, and S. Rosset, “Grouped graphical 
Granger modeling for gene expression regulatory networks discovery.” 
Bioinformatics (Oxford, England), vol. 25, no. 12, pp. 110-118, jun 
2009. 

[3] A. Shojaie and G. Michailidis, “Discovering graphical Granger causality 
using the truncating lasso penalty.” Bioinformatics (Oxford, England), 
vol. 26, no. 18, sep 2010. 

[4] J. Songsiri, “Sparse autoregressive model estimation for learning 
Granger causality in time series,” in Proceedings of the 38th IEEE 
International Conference on Acoustics, Speech, and Signal Processing 
(ICASSP), no. 1, 2013, pp. 3198-3202. 

[5] T. Evgeniou and M. Pontil, “Regularized multi-task learning,” Proceed¬ 
ings of the 10th ACM S1GKDD, pp. 109-117, 2004. 

[6] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Structured Sparsity 
through Convex Optimization,” Statistical Science, vol. 27, no. 4, pp. 
450^168, nov 2012. 

[7] C. W. J. Granger, “Investigating Causal Relations by Econometric 
Models and Cross-spectral Methods,” Econometrica: Journal of the 
Econometric Society, vol. 37, no. 3, pp. 424-438, 1969. 

[8] M. Eichler, “Graphical modelling of multivariate time series,” Probabil¬ 
ity Theory and Related Fields, vol. 153, no. 1-2, pp. 233-268, 2012. 

[9] A. E. Hoerl and R. W. Kennard, “Ridge regression: biased estimation 
for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55-67, 
1970. 

[10] A. Beck and M. Teboulle, “Gradient-based algorithms with applications 
to signal recovery,” Convex Optimization in Signal Processing and 
Communications, 2009. 

[11] Y. Liu and M. Bahadori, “A Survey on Granger Causality: A Compu¬ 
tational View,” 2012. 

[12] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Jour¬ 
nal of the Royal Statistical Society. Series B, vol. 58, no. 1, pp. 267-288, 
1996. 

[13] A. Arnold, Y. Liu, and N. Abe, “Temporal causal modeling with 
graphical granger methods,” Proceedings of the 13th ACM S1GKDD 
international conference on Knowledge discovery and data mining - 
KDD ’ 07, p. 66, 2007. 

[14] Y. Ren, Z. Xiao, and X. Zhang, “Two-step adaptive model selection for 
vector autoregressive processes,” Journal of Multivariate Analysis, vol. 
116, pp. 349-364, apr 2013. 

[15] M. T. Bahadori and Y. Liu, “On Causality Inference in Time Series,” 
2012 AAAI Fall Symposium Series, pp. 8-13, 2012. 

[16] M. Gong, K. Zhang, D. Tao, R Geiger, and I. Systems, “Discovering 
Temporal Causal Relations from Subsampled Data,” in ICML, vol. 37, 
2015. 

[17] J. Peters, D. Janzing, and B. Scholkopf, “Causal Inference on Time 
Series using Restricted Structural Equation Models,” in Advances in 
Neural Information Processing Systems 26 (NIPS 2013), 2013. 

[18] A. Jalali and S. Sanghavi, “Learning the dependence graph of time series 
with latent factors,” in International Conference on Machine Learning 
(ICML), 2012, pp. 473-480. 

[19] P. Geiger, K. Zhang, M. Gong, D. Janzing, and B. Scholkopf, “Causal 
Inference by Identification of Vector Autoregressive Processes with 
Hidden Components,” International Conference on Machine Learning 
(ICML), vol. 37, 2015. 

[20] J. Songsiri, “Learning Multiple Granger Graphical Models via Group 
Fused Lasso,” in IEEE Asian Control Conference (ASCC), 2015. 

[21] T. Huang and J. Schneider, “Learning bi-clustered vector autoregressive 
models,” in Machine Learning and Knowledge Discovery in Databases. 
Springer Berlin Heidelberg, 2012, pp. 741-756. 

[22] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task feature 
learning,” Machine Learning, vol. 73, no. 3, pp. 243-272, jan 2008. 

[23] G. Swirszcz and A. C. Lozano, “Multi-level Lasso for sparse multi-task 
regression,” in International Conference on Machine Learning (ICML), 
2012, pp. 361-368. 

[24] B. Bakker and T. Heskes, “Task clustering and gating for bayesian 
multitask learning,” Journal of Machine Learning Research, vol. 4, pp. 
83-99, 2003. 

[25] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, “Multi-task learning 
for classification with Dirichlet process priors,” Journal of Machine 
Learning Research, vol. 8, pp. 35-63, 2007. 


[26] L. Jacob, F. Bach, and J. P. Vert, “Clustered multi-task learning: A 
convex formulation,” in Advances in Neural Information Processing 
Systems (NIPS), 2009, pp. 1-8. 

[27] Z. Kang, L. Angeles, K. Grauman, F. Sha, L. Angeles, K. Grauman, and 
F. Sha, “Learning with whom to share in multi-task feature learning,” in 
Proceedings of the 28th International Conference on Machine Learning, 
no. 4, 2011, pp. 4—5. 

[28] A. Kumar and Hal Daume III, “Learning task grouping and overlap in 
multi-task learning,” in International Conference on Machine Learning 
(ICML), 2012. 

[29] J. H. Stock and M. W. Watson, “Generalized Shrinkage Methods for 
Forecasting Using Many Predictors,” Journal of Business & Economic 
Statistics, vol. 30, no. 4, pp. 481^-93, oct 2012. 



